Comparing source and country inventories#
The Climate TRACE dataset provides two sets of inventories:
tracking every source of emissions
using econonomic output statistics to derive country-level emissions
The second approach is described in more details in (TODO: ref to the implicit doc).
This chapter compares the two approaches and identifies gaps between the two results. We conclude with a statistical approach to combine the two estimates into a single, more precise estimate, using Bayesian statistics.
%load_ext autoreload
%autoreload 2
import polars as pl
pl.enable_string_cache() # TODO: remove
import plotly.express as px
import plotly.graph_objects as go # Will use more advanced visualization functions
from ctrace.constants import *
import ctrace as ct
We first load the data for the year 2022, only focusing on CO2-equivalent emissions.
sdf_gy is the dataframe of the source emissions for this gas and this year.
year = 2022
gas = "co2e_100yr"
cedf = ct.read_country_emissions()
sdf = ct.read_source_emissions(year=year)
sdf_gy = sdf.filter(c_gas == gas)
cedf_gy = cedf.filter(c_gas == gas).filter(c_start_time.dt.year()==year)
Comparing visually the two datasets#
Let us check first if the numbers roughly match for each country and for each sector. Are we in the same ballpark estimate between the tracking of the sources and country-level aggregations?
First aggregate the emissions by country, sector and sub-sector:
cedf_gy_agg = (cedf_gy
.group_by(c_iso3_country, c_ct_package, c_ct_file)
.agg(c_emissions_quantity.sum())
)
sdf_gy_agg = (sdf_gy
.group_by(c_iso3_country, c_ct_package, c_ct_file)
.agg(c_emissions_quantity.sum())
.collect()
)
The jdf dataframe is the joint dataframe holding estimates for each sector, country, sub-sector.
Some sectors and subsectors will be missing in one or the other dataframe, depending
on what countries have reported.
Technical: we are using an outer join to capture both sides of the data, even if one is
missing.
EMISSIONS_QUANTITY_S = EMISSIONS_QUANTITY + "_s"
EMISSIONS_QUANTITY_CE = EMISSIONS_QUANTITY + "_ce"
jdf = (cedf_gy_agg
.join(sdf_gy_agg, on=[ISO3_COUNTRY, CT_PACKAGE, CT_FILE], suffix="_s", how="outer")
.with_columns(
c_iso3_country.fill_null(C(ISO3_COUNTRY+"_s")),
c_ct_package.fill_null(C(CT_PACKAGE+"_s")),
c_ct_file.fill_null(C(CT_FILE+"_s")),
c_emissions_quantity.alias(EMISSIONS_QUANTITY_CE),
).select(
c_iso3_country,
c_ct_package,
c_ct_file,
EMISSIONS_QUANTITY_S,
EMISSIONS_QUANTITY_CE
))
jdf
| iso3_country | ct_package | ct_file | emissions_quantity_s | emissions_quantity_ce |
|---|---|---|---|---|
| enum | cat | cat | f64 | f64 |
| "AIA" | "agriculture" | "other-agricult… | null | 0.0 |
| "ATF" | "agriculture" | "other-agricult… | null | 0.0 |
| "ALB" | "agriculture" | "other-agricult… | null | 439610.5 |
| "CHL" | "agriculture" | "other-agricult… | null | 2639898.2 |
| "CMR" | "agriculture" | "other-agricult… | null | 2988114.3 |
| … | … | … | … | … |
| "USA" | "fossil_fuel_op… | "oil-and-gas-pr… | 8.4849e8 | null |
| "ZNC" | "forestry_and_l… | "net-shrubgrass… | -82226.761935 | null |
| "SAU" | "agriculture" | "synthetic-fert… | 2660.040524 | null |
| "KHM" | "agriculture" | "synthetic-fert… | 69423.989415 | null |
| "AND" | "agriculture" | "manure-left-on… | 0.0 | null |
Quick check where we have differences in missing data:
CTODO
Shouldn’t the sources be a subset of all the data reported by the countries? I would assume that the countries and EDGAR / FAOSTATS are more comprehensive at country level than CTrace.
(jdf.group_by(
C(EMISSIONS_QUANTITY_S).is_not_null().alias("present_s"),
C(EMISSIONS_QUANTITY_CE).is_not_null().alias("present_ce"))
.agg(c_iso3_country.count())
)
| present_s | present_ce | iso3_country |
|---|---|---|
| bool | bool | u32 |
| true | true | 4835 |
| false | true | 7997 |
| true | false | 84 |
# jdf.filter(C(EMISSIONS_QUANTITY_S).is_not_null() & C(EMISSIONS_QUANTITY_CE).is_null())
(jdf
.with_columns(((C(EMISSIONS_QUANTITY_S) > 0) | (C(EMISSIONS_QUANTITY_CE) > 0)).alias("is_emission"))
.group_by("is_emission")
.agg(C(EMISSIONS_QUANTITY_S).sum(), C(EMISSIONS_QUANTITY_CE).sum()))
| is_emission | emissions_quantity_s | emissions_quantity_ce |
|---|---|---|
| bool | f64 | f64 |
| false | -5.6840e10 | -2.1257e10 |
| true | 9.4461e10 | 7.6428e10 |
| null | -461722.074512 | 0.0 |
To better compare, we also add a ratio between the source-defined emissions and the country-defined emissions. If we knew perfectly the emissions, this number would be one. However:
the sources should be just an observed fraction of all the emissions, so this fraction should generally be expected to be less than one
the emissions are known only to some degree of accuracy. There is noise in the measurement. As a result, the fraction should be wiggling around one.
Some sectors are just not observed at all in some countries, so having infinite values is expected.
jdf_cmp = (jdf
.with_columns((C(EMISSIONS_QUANTITY_S).fill_null(0.0)/C(EMISSIONS_QUANTITY_CE).fill_null(0.0)).alias("found"))
)
jdf_cmp
| iso3_country | ct_package | ct_file | emissions_quantity_s | emissions_quantity_ce | found |
|---|---|---|---|---|---|
| enum | cat | cat | f64 | f64 | f64 |
| "AIA" | "agriculture" | "other-agricult… | null | 0.0 | NaN |
| "ATF" | "agriculture" | "other-agricult… | null | 0.0 | NaN |
| "ALB" | "agriculture" | "other-agricult… | null | 439610.5 | 0.0 |
| "CHL" | "agriculture" | "other-agricult… | null | 2639898.2 | 0.0 |
| "CMR" | "agriculture" | "other-agricult… | null | 2988114.3 | 0.0 |
| … | … | … | … | … | … |
| "USA" | "fossil_fuel_op… | "oil-and-gas-pr… | 8.4849e8 | null | inf |
| "ZNC" | "forestry_and_l… | "net-shrubgrass… | -82226.761935 | null | -inf |
| "SAU" | "agriculture" | "synthetic-fert… | 2660.040524 | null | inf |
| "KHM" | "agriculture" | "synthetic-fert… | 69423.989415 | null | inf |
| "AND" | "agriculture" | "manure-left-on… | 0.0 | null | NaN |
Plotting one set of emissions against the other, what do we see?
First of all, the difference between various sources is massive (6 orders of magnitude between China’s power generation and the UK’s petland fires).
Also, as expected, most of the source-based emissions are lower than the more comprehensive country-based emissions estimates.
There seems to be a data issues with the forestry and the agriculture datasets. The source-based emissions are exactly triple the values reported in the country-based emissions.
CTODO
Verify with CT about the values there. Maybe they know about it already?
The US Virgin Islands seem to undereport a manufacturing activity.
CTODO
Verify with CT about the values there. Maybe they know about it already?
The Feroe islands seem to be have big discrepancies in their forest management
CTODO
Verify with CT about the values there. Maybe they know about it already?
fig1 = px.scatter(jdf_cmp, x=EMISSIONS_QUANTITY_CE, y=EMISSIONS_QUANTITY_S,
color=CT_PACKAGE,
hover_name=ISO3_COUNTRY,
hover_data=[CT_PACKAGE, CT_FILE])
fig2 = px.line(jdf_cmp, x=EMISSIONS_QUANTITY_S, y=EMISSIONS_QUANTITY_S).update_traces(marker=dict(
color='red'))
fig = go.Figure(data = fig1.data + fig2.data)
(fig
.update_xaxes(type="log")
.update_yaxes(type="log")
.update_layout(xaxis_title=EMISSIONS_QUANTITY_CE, yaxis_title=EMISSIONS_QUANTITY_S))
px.histogram(jdf_cmp, x="found",
color=CT_PACKAGE,
range_x=[0,3],
nbins=5000)
Combining both estimates with Bayesian analysis#
So far, we have been taking the figures for granted. Can we have a more precise estimate for each sector? Can we put some uncertainty margins around these numbers?
I am going to show in this section how Bayesian analysis can help. Our goal is: for a given sector, given the reports from the sources and the report from country inventories, can we have a better guess at the true value of emissions?
We will take some values from the manufacturing sector, but other sectors can also be considered.
import arviz as az
import pymc as pm
import numpy as np
import scipy
import logging
logging.getLogger("matplotlib").setLevel("WARNING")
logging.getLogger("filelock").setLevel("WARNING")
WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
subs_df = (jdf_cmp
.filter(c_ct_package == "manufacturing")
.filter((C("found") > 0.1) & (C("found") < 10))
.drop_nulls()
.sort(by=EMISSIONS_QUANTITY_S))
# px.histogram(subs_df, x=EMISSIONS_QUANTITY_CE,
# hover_name=ISO3_COUNTRY,
# # log_x=True,
# # range_x=[0,3],
# nbins=500)
subs_df
| iso3_country | ct_package | ct_file | emissions_quantity_s | emissions_quantity_ce | found |
|---|---|---|---|---|---|
| enum | cat | cat | f64 | f64 | f64 |
| "CHL" | "manufacturing" | "chemicals" | 2885.0 | 2885.0 | 1.0 |
| "SYR" | "manufacturing" | "steel" | 3744.0 | 3744.0 | 1.0 |
| "VEN" | "manufacturing" | "chemicals" | 6396.0 | 6396.0 | 1.0 |
| "VEN" | "manufacturing" | "steel" | 6564.0 | 6564.0 | 1.0 |
| "TCD" | "manufacturing" | "chemicals" | 13428.0 | 13428.0 | 1.0 |
| … | … | … | … | … | … |
| "CHN" | "manufacturing" | "aluminum" | 2.13497907e8 | 2.13497907e8 | 1.0 |
| "IND" | "manufacturing" | "cement" | 2.52596928e8 | 2.58232689e8 | 0.978176 |
| "CHN" | "manufacturing" | "chemicals" | 2.6414734e8 | 2.8985344e8 | 0.911313 |
| "CHN" | "manufacturing" | "cement" | 7.046727e8 | 1.1672e9 | 0.603719 |
| "CHN" | "manufacturing" | "steel" | 9.22971074e8 | 1.2870e9 | 0.717133 |
Here is the model that we will assume. Suppose that the true emission of a given sector has a normalized value of 1GT CO2e. Countries have an incentive to under-report the true value, because they may have more commitments to reduce their emissions in the future. Emissions are hard to measure in general, so undereporting is more likeley than overreporting. It makes more sense to use a statistical distribution that is more skewed towards lower values, that is, an asymetrical distribution. Also, emissions should be positive for the industry. Negative values are not likely.
This is why a first example of a model is this shifted-flipped Gamma distribution. We may argue about the exact parameters and the margins of errors, but this should give reasonable results as a start.
Let’s assume that this is also the expected shape for the source emissions, for similar reasons:
we do not know precisely how much we missed
for what we are accounting, there is also a margin of error
This model can be refined based on expert knowledge (we may have further information about how much was missed for a sector).
def shiftedGamma(mean, size):
# Picking some sensible parameters
alpha = 3.0
beta = 2.0
# mode = (alpha-1) / beta # Putting the observed value on the mode of the distribution
mean_ = (alpha) / beta
# Mapping 10.m -> 0, m -> 1.0
return mean * (4/3. - (1.0/3) * (1.0 / mean_) * pm.Gamma.dist(alpha=alpha, beta = beta, size=size))
x = shiftedGamma(mean=1.0, size=1)
xs = np.linspace(0,2,100)
ys = pm.logp(x, xs).eval()
px.line(x=xs,y=np.exp(ys))
Armed with this first model, we define our statistical inference problem: given two observations of the emissions (per country and per source), what is the likely distribution of true emission values?
Here is an implementation of this model in PyMC, a popular library for Bayesian analysis.
basic_model = pm.Model()
# The values of the emissions per source and per country
em_s = np.array(subs_df[EMISSIONS_QUANTITY_S])
em_ce = np.array(subs_df[EMISSIONS_QUANTITY_CE])
# start_pt = scipy.stats.gmean([em_s, em_ce], axis=0)
# Given the wide range of scales, normalizing them around the source emissions
norm_factor = em_s
em_s_normed = em_s / norm_factor
em_ce_normed = em_ce / norm_factor
start_pt_normed = scipy.stats.gmean([em_s_normed, em_ce_normed], axis=0)
with basic_model:
# The initial value: the geometric mean of both estimate
init_est = start_pt_normed
# The Gamma distribution is only defined over positive values, so we need to clamp the values of emissions to prevent
# PyMC from sampling values with zero probability.
max_bound = np.maximum(em_s_normed, em_ce_normed)
epsi = pm.TruncatedNormal("epsi", mu = init_est, sigma=1.5 * init_est, lower=max_bound * 0.75, upper=max_bound * 5.0)
# TODO: maybe just use gamma distributions directly?
cp = pm.CustomDist("cp", epsi, dist=shiftedGamma, observed=em_ce_normed)
ip = pm.CustomDist("ip", epsi, dist=shiftedGamma, observed=em_s_normed)
We run the sampler with 4 chains. The default parameters should work out of the box for such a simple model.
num_chains = 4
with basic_model:
idata = pm.sample(chains=num_chains)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [epsi]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 3 seconds.
# Not plotting, too many variables to plot
# az.plot_trace(idata, combined=True);
summ = az.summary(idata, round_to=2,stat_funcs={"median":lambda z:np.percentile(z,50)})
summ
| mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | median | |
|---|---|---|---|---|---|---|---|---|---|---|
| epsi[0] | 0.96 | 0.14 | 0.79 | 1.19 | 0.0 | 0.0 | 9968.17 | 2304.78 | 1.00 | 0.93 |
| epsi[1] | 0.96 | 0.13 | 0.78 | 1.19 | 0.0 | 0.0 | 7688.35 | 2447.17 | 1.00 | 0.93 |
| epsi[2] | 0.97 | 0.14 | 0.79 | 1.20 | 0.0 | 0.0 | 7450.36 | 2345.43 | 1.00 | 0.93 |
| epsi[3] | 0.96 | 0.15 | 0.79 | 1.19 | 0.0 | 0.0 | 9298.34 | 2360.97 | 1.00 | 0.93 |
| epsi[4] | 0.96 | 0.13 | 0.78 | 1.20 | 0.0 | 0.0 | 8744.27 | 1989.44 | 1.00 | 0.93 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| epsi[381] | 0.96 | 0.13 | 0.78 | 1.20 | 0.0 | 0.0 | 8343.44 | 2300.39 | 1.00 | 0.93 |
| epsi[382] | 0.97 | 0.14 | 0.79 | 1.21 | 0.0 | 0.0 | 10025.42 | 2777.62 | 1.00 | 0.94 |
| epsi[383] | 1.02 | 0.14 | 0.84 | 1.26 | 0.0 | 0.0 | 8549.79 | 2620.83 | 1.00 | 0.99 |
| epsi[384] | 1.53 | 0.20 | 1.26 | 1.87 | 0.0 | 0.0 | 9032.79 | 2482.91 | 1.01 | 1.48 |
| epsi[385] | 1.28 | 0.18 | 1.06 | 1.58 | 0.0 | 0.0 | 7118.08 | 2146.86 | 1.00 | 1.24 |
386 rows × 10 columns
All the inferred values are now aggregated back. The Bayesian analysis also can tell us if the observations are likely to be close to the true value, or if they are inconsistent with the inferred range of possible emissions values. If an observed value for the source or the country emissions is marked as invalid, we can conclude that this combination is not consistent and that we should refine these values by better analysis.
summ = summ.reset_index()
HDI_LOW = "hdi_3%"
HDI_HIGH = "hdi_97%"
summ["mean"] *= norm_factor
summ["median"] *= norm_factor
summ["sd"] *= norm_factor
summ[HDI_LOW] *= norm_factor
summ[HDI_HIGH] *= norm_factor
summ["s_valid"] = (summ[HDI_LOW] < em_s) & (em_s < summ[HDI_HIGH])
summ["ce_valid"] = (summ[HDI_LOW] < em_ce) & (em_ce < summ[HDI_HIGH])
summ["valid"] = summ["s_valid"] & summ["ce_valid"]
summ["s"] = em_s
summ["ce"] = em_ce
summ[ISO3_COUNTRY] = subs_df[ISO3_COUNTRY]
summ[CT_PACKAGE] = subs_df[CT_PACKAGE]
summ[CT_FILE] = subs_df[CT_FILE]
summ[EMISSIONS_QUANTITY_S]=subs_df[EMISSIONS_QUANTITY_S]
summ[EMISSIONS_QUANTITY_CE]=subs_df[EMISSIONS_QUANTITY_CE]
summ
| index | mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | ... | s_valid | ce_valid | valid | s | ce | iso3_country | ct_package | ct_file | emissions_quantity_s | emissions_quantity_ce | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | epsi[0] | 2.769600e+03 | 4.039000e+02 | 2.279150e+03 | 3.433150e+03 | 0.0 | 0.0 | 9968.17 | 2304.78 | 1.00 | ... | True | True | True | 2885.0 | 2.885000e+03 | CHL | manufacturing | chemicals | 2885.0 | 2.885000e+03 |
| 1 | epsi[1] | 3.594240e+03 | 4.867200e+02 | 2.920320e+03 | 4.455360e+03 | 0.0 | 0.0 | 7688.35 | 2447.17 | 1.00 | ... | True | True | True | 3744.0 | 3.744000e+03 | SYR | manufacturing | steel | 3744.0 | 3.744000e+03 |
| 2 | epsi[2] | 6.204120e+03 | 8.954400e+02 | 5.052840e+03 | 7.675200e+03 | 0.0 | 0.0 | 7450.36 | 2345.43 | 1.00 | ... | True | True | True | 6396.0 | 6.396000e+03 | VEN | manufacturing | chemicals | 6396.0 | 6.396000e+03 |
| 3 | epsi[3] | 6.301440e+03 | 9.846000e+02 | 5.185560e+03 | 7.811160e+03 | 0.0 | 0.0 | 9298.34 | 2360.97 | 1.00 | ... | True | True | True | 6564.0 | 6.564000e+03 | VEN | manufacturing | steel | 6564.0 | 6.564000e+03 |
| 4 | epsi[4] | 1.289088e+04 | 1.745640e+03 | 1.047384e+04 | 1.611360e+04 | 0.0 | 0.0 | 8744.27 | 1989.44 | 1.00 | ... | True | True | True | 13428.0 | 1.342800e+04 | TCD | manufacturing | chemicals | 13428.0 | 1.342800e+04 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 381 | epsi[381] | 2.049580e+08 | 2.775473e+07 | 1.665284e+08 | 2.561975e+08 | 0.0 | 0.0 | 8343.44 | 2300.39 | 1.00 | ... | True | True | True | 213497907.0 | 2.134979e+08 | CHN | manufacturing | aluminum | 213497907.0 | 2.134979e+08 |
| 382 | epsi[382] | 2.450190e+08 | 3.536357e+07 | 1.995516e+08 | 3.056423e+08 | 0.0 | 0.0 | 10025.42 | 2777.62 | 1.00 | ... | True | True | True | 252596928.0 | 2.582327e+08 | IND | manufacturing | cement | 252596928.0 | 2.582327e+08 |
| 383 | epsi[383] | 2.694303e+08 | 3.698063e+07 | 2.218838e+08 | 3.328256e+08 | 0.0 | 0.0 | 8549.79 | 2620.83 | 1.00 | ... | True | True | True | 264147340.0 | 2.898534e+08 | CHN | manufacturing | chemicals | 264147340.0 | 2.898534e+08 |
| 384 | epsi[384] | 1.078149e+09 | 1.409345e+08 | 8.878876e+08 | 1.317738e+09 | 0.0 | 0.0 | 9032.79 | 2482.91 | 1.01 | ... | False | True | False | 704672700.0 | 1.167219e+09 | CHN | manufacturing | cement | 704672700.0 | 1.167219e+09 |
| 385 | epsi[385] | 1.181403e+09 | 1.661348e+08 | 9.783493e+08 | 1.458294e+09 | 0.0 | 0.0 | 7118.08 | 2146.86 | 1.00 | ... | False | True | False | 922971074.0 | 1.287029e+09 | CHN | manufacturing | steel | 922971074.0 | 1.287029e+09 |
386 rows × 21 columns
Here are all the estimated ranges of emission values for the considered sectors.
Looking at the top sector (Chinese manufacturing), our Bayesian analysis inferred an median emission of 1.14BT CO2e, and the true value being very likely between 0.96BT and 1.4BT (94% confidence interval). However, the value of the source emission is 0.9BT. That value is likely to be inaccurate and we should explore further if we can refine it.
You should zoom and hover over the data points to see how much the values diverge.
px.scatter(
summ,y="median",
color="valid",
log_y=True,
error_y = summ[HDI_HIGH]-summ["median"],
error_y_minus=summ["median"]-summ[HDI_LOW],
hover_data=[CT_PACKAGE,ISO3_COUNTRY,CT_FILE,EMISSIONS_QUANTITY_S,EMISSIONS_QUANTITY_CE]
)
This is another visualization that puts all the data in the same plot. You can see how this simple model shines for estimating conservative figures: between two wildly different values, it will lean towards the higher one, and when they are close, it will average them out.
fig = go.Figure()
xs = list(range(len(summ)))
fig.add_trace(
go.Scatter(
x=xs,
y=summ["median"],
name="estimate",
mode='markers',
error_y=dict(
type='data',
symmetric=False,
array=summ[HDI_HIGH]-summ["median"],
arrayminus=summ["median"]-summ[HDI_LOW])
)
)
fig.add_trace(
go.Scatter(
x=xs,
y=summ["s"],
name="source",
mode='markers',
)
)
fig.add_trace(go.Scatter(
x=xs,
y=summ["ce"],
name="countries",
mode='markers',
))
fig.update_yaxes(type="log")
fig.show()
Conclusion#
In this section, we first saw how to visually inspect the differences between emissions of different countries, both from a source level and country report level. We identified a couple of sectors in which emissions diverge between both methods of reporting.
Finally, we saw how to use Bayesian analysis to start combining multiple data sources together. This example only considers two estimates. See as an exercise what happens with different distributions, such as Gaussian distributions for example.